synthetic caption
- North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
- North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Automobiles & Trucks > Manufacturer (0.94)
- Transportation > Ground > Road (0.94)
- Leisure & Entertainment > Sports (0.68)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
- North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Automobiles & Trucks > Manufacturer (0.93)
- Transportation > Ground > Road (0.93)
- Leisure & Entertainment > Sports (0.67)
Improving multimodal datasets with image captioning
Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity. The synthetic captions used in our experiments are now available on HuggingFace.
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Fiastre, Gabriel, Yang, Antoine, Schmid, Cordelia
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
Dufour, Nicolas, Degeorge, Lucas, Ghosh, Arijit, Kalogeiton, Vicky, Picard, David
Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
- North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Automobiles & Trucks > Manufacturer (0.94)
- Transportation > Ground > Road (0.94)
- Leisure & Entertainment > Sports (0.68)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
- North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Automobiles & Trucks > Manufacturer (0.93)
- Transportation > Ground > Road (0.93)
- Leisure & Entertainment > Sports (0.67)
MobileCLIP2: Improving Multi-Modal Reinforced Training
Faghri, Fartash, Vasu, Pavan Kumar Anasosalu, Koc, Cem, Shankar, Vaishaal, Toshev, Alexander, Tuzel, Oncel, Pouransari, Hadi
Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
Mining Contextualized Visual Associations from Images for Creativity Understanding
Sahu, Ananya, Ananthram, Amith, McKeown, Kathleen
Understanding another person's creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.
How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
Brack, Manuel, Katakol, Sudeep, Friedrich, Felix, Schramowski, Patrick, Ravi, Hareesh, Kersting, Kristian, Kale, Ajinkya
Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. W e also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)